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not  met,  then  the  computation  of  any  type  of  difference  score  is  inappropriate 
and  the  scores  themselves  are  useless  for  measuring  growth  or  change. 

Two  studies  investigated  the  tenability  of  the  assumption  that  classroom  in¬ 
struction  results  in  increases  in  students'  achievement  levels  while  the  qual¬ 
itative  nature  of  that  achievement  remains  constant  across  time.  The  data 
utilized  were  the  item  responses  to  tests  in  basic  mathematics  and  in  general 
biology  administered  as  pretests  and  after  instruction  to  students  enrolled  in 
those  courses. 

Results  indicated  that  this  assumption  was  not  tenaole  in  the  biology  data 
set,  where  increases  in  mean  achievement  level  were  accompanied  by  correspond¬ 
ing  changes  in  the  factor  structure  underlying  the  item  responses.  For  the 
mathematics  data,  however,  there  was  no  such  violation  of  the  assumption:  As 
student  achievement  levels  increased  the  underlying  factor  structure  remained 
unchanged.  The  implications  of  these  results  for  psychology,  education,  and 
program  evaluation  are  noted. 
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Dimensionality  of  Measured  Achievement  Over  Time 


The  measurement  of  individual  or  group  change  is  central  to  many  issues  in 
the  fields  of  psychology,  education,  and  program  evaluation.  Psychologists, 
educators,  and  (more  recently)  evaluators  typically  use  differences  in  test 
scores  to  quantify  the  effects  of  experimental  treatments  and  educational  pro¬ 
grams  on  individuals  and  on  groups  of  individuals. 

The  typical  paradigm  for  measuring  change  involves  the  administration  of  a 
standardized  achievement  test  both  before  and  after  an  experimental  treatment  or 
program  implementation;  the  effect  of  the  treatment  intervention  is  then  consid¬ 
ered  to  be  a  function  of  the  mean  difference  between  the  two  sets  of  test 
scores.  If  two  or  more  groups  of  students  are  involved,  comparisons  can  also  be 
made  between  treatment  and  control  groups,  or  among  groups  exposed  to  various 
treatments  or  involved  in  several  different  programs.  Again,  evaluation  of 
treatment  effects  involves  comparing  the  mean  achievement  gain  (typically,  a 
function  of  the  difference  scores)  observed  for  each  group.  Individual  gain  or 
change  is  also  frequently  used  to  measure  an  individual's  growth  in  achievement 
level  or  change  due  to  a  treatment  or  special  program. 

Lord  (1963)  and  Cronbach  and  Furby  (1970),  among  others,  have  discussed  the 
.aethodological  and  statistical  problems  involved  in  using  difference  scores  to 
measure  change  or  growth  and  have  presented  some  possible  solutions.  Whether 
measurements  of  change  involve  the  use  of  simple  difference  scores,  their  deriv¬ 
atives,  or  some  more  complex  methodological  design,  the  measurement  process  it¬ 
self  assumes  that  the  treatment  or  instruction  results  in  increased  levels  of 
the  same  trait  or  characteristic  that  was  measured  originally  and  that  the  only 
change  that  occurs  is  a  quantitative  one. 

That  this  assumption  may  be  violated  has  long  been  evident  in  studies  of 
intelligence  and  intellectual  growth.  Garrett  (1946)  noted  that  "intelligence 
changes  in  its  organization"  (p.  373)  and  called  for  corresponding  changes  in 
the  way  intelligence  is  measured.  This  "differentiation  hypothesis"  spawned 
much  research  (see  Reinert,  1970,  for  a  review)  concerning  the  changes  in  the 
structure  and  organization  of  intelligence  throughout  the  human  life  span.  Some 
of  these  studies  report  results  supporting  the  hypothesis  of  age  differentia¬ 
tion;  others  offer  support  for  a  hypothesis  of  age  integration,  and  still  others 
provide  evidence  in  support  of  both  these  hypotheses.  Nearly  all  this  research, 
however,  has  found  that  the  structure  of  intelligence,  as  defined  by  factor 
analysis,  does  not  remain  constant  with  age  and  experience. 

Other  authors  (Anastasi,  1936;  Ferguson,  1954;  Games,  1962;  Woodrow,  1938, 
1939a,  1939b,  1939c)  have  investigated  the  changes  in  verbal  ability  and  intel¬ 
lectual  factor  structure  that  accompany  shorter  term  training  and  practice. 
Similar  factor-analytic  investigations  have  been  made  in  the  areas  of  psychomo¬ 
tor  behavior  (Fleishman,  1953,  1957,  1960;  Fleishman  &  Hcrapel,  1954,  1955; 
Greene,  1943),  psycholinguistic  abilities  (Querishi,  1967),  word  association 
(Sullivan  &  Moran,  1967;  Swartz  &  Moran,  1968),  and  even  the  learning  of  Morse 
code  (Fleishman  &  Fruchter,  1960).  All  these  authors  have  found  that  the  facto- 
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rial  structure  of  abilities  underlying  task  performance  changes  in  a  systematic 
way  with  training  and  practice.  An  individual's  status  at  a  later  point  in 
time,  then,  may  be  qualitatively  different  from  his/her  status  as  originally 
measured . 

Wohlwill  (1970)  discusses  this  issue  of  quantitative  versus  qualitative 
change  more  generally  in  the  area  of  developmental  psychology  and,  like  Garrett 
(1946),  calls  for  more  sophisticated  scaling  methods  which  will 

...  allow  us  to  assess  an  individual's  status  on  a  developmental  dimen¬ 
sion  in  a  manner  such  as  to  ensure  not  only  comparability  of  content 
for  the  different  parts  of  that  dimension,  but  at  the  same  time  a  con¬ 
tinuous  scale  along  which  developmental  change  can  be  charted  .... 
Postulating  a  unitary  dimension  across  the  age  span  under  investigation 
presupposes  that  there  are  no  major  discontinuities  in  the  development 
of  the  behavior  in  question,  such  as  there  obviously  are  in  the  assess¬ 
ment  of  intelligence  when  we  move  from  infancy  to  childhood.  (  p.  154) 

Although  Reinert  ( 1970)  called  for  the  investigation  of  possible  factor- 
structure  changes  in  areas  other  than  intelligence  and  abilities  more  than  a 
decade  ago,  no  research  has  yet  extended  this  line  of  questioning  into  the  area 
of  classroom  achievement.  That  is,  there  have  been  no  reported  studies  that 
have  systematically  investigated  whether  the  individual  and  group  changes  that 
occur  after  classroom  instruction  or  program  participation  are  quantitative 
changes  in  the  level  of  achievement,  as  is  generally  assumed,  or  whether  more 
qualitative  changes  in  the  structure  of  the  achievement  variable  have  occurred. 

Kingsbury  and  Weiss  (1979)  studied  the  effects  of  testing  students  at  dif¬ 
ferent  points  in  instruction.  They  reported  that  the  single  factor  extracted 
from  the  item  responses  to  a  college  general  biology  examination  administered  on 
the  first  day  of  class  and  the  factor  extracted  from  the  item  responses  to  a 
classroom  midquarter  examination  differed  markedly  from  each  other  in  terms  of 
strength;  however,  they  could  not  further  investigate  the  similarity  of  the  fac¬ 
tor  pattern  loadings  from  both  administrations.  They  cautioned  that  replica¬ 
tions  of  their  findings  contrasting  the  pretest  factor  with  the  later  achieve¬ 
ment  factor  would  render  difference  scores  "completely  useless"  as  indicators  of 
achievement  level  growth,  since  different  variables  would,  in  fact,  be  measured 
at  the  two  points  in  time. 

The  importance  of  such  a  conclusion  should  not  be  underestimated.  If  dif¬ 
ferent  characteristics  are,  in  fact,  being  measured  at  two  different  occasions, 
then  the  computation  of  any  type  of  difference  score  is  inappropriate,  and  the 
evaluation  of  program  effectiveness  and  gains  in  individual  student  achievement 
must  be  made  on  some  other  basis.  It  is  justifiable  to  use  difference  scores 
(statistical  and  methodological  issues  notwithstanding)  only  when  it  can  be  dem¬ 
onstrated  that  quantitative  changes  are  the  only  changes  accompanying  instruc¬ 
tion. 

Purpose 


The  objectives  of  the  present  studies  were  to  investigate  the  nature  of  the 
changes  in  the  dimensionality  of  achievement  that  occurred  following  instruction 
in  two  different  achievement  domains — basic  mathematics  and  general  biology — and 
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to  determine  the  appropriateness  of  calculating  difference  scores  in  order  to 
measure  change  in  these  domains. 


STUDY  I 
Method 


Subjects  and  Tests 

Data  were  obtained  from  students  enrolled  in  mathematics  classes  at  the 
University  of  Minnesota's  General  College  during  the  fall  quarter  of  1979. 

These  students  were  administered  a  35-item  Arithmetic  Placement  Test  (APT)  on 
the  first  day  of  class  (pretest)  and  again  as  a  final  examination  (posttest). 

The  APT  is  composed  of  five-alternative  multiple-choice  items  covering  such  top¬ 
ics  as  addition,  subtraction,  multiplication,  and  division  of  whole  numbers, 
fractions,  decimals,  and  percents. 

Item  responses  were  coded  as  correct,  incorrect,  or  missing  for  the  259 
students.  However,  only  136  of  the  students  answered  every  item  on  the  APT  on 
both  occasions,  i.e.,  123  students  omitted  or  did  not  reach  at  least  one  item  on 
either  occasion.  In  many  cases,  clusters  of  items  were  omitted  in  the  middle  of 
the  tests,  which  implied  that  students  were  omitting  the  groups  of  items  for 
which  they  did  not  know  the  answers,  rather  than  reaching  a  time  limit  for  the 
test.  To  deal  with  this  problem  of  missing  data,  a  15%-missing-data  criterion 
was  employed.  A  student's  response  protocol  was  deleted  from  the  data  set  if 
the  student  omitted  more  than  five  items  (i.e.,  15%  of  35  items)  on  either  the 
pretest  or  the  posttest.  This  resulted  in  a  group  of  220  students  on  which  all 
further  analyses  were  based.  For  these  220  students,  missing  data  were  coded  as 
incorrect  on  the  assumption  that  the  student  did  not  answer  the  item  because 
he/she  did  not  know  the  answer  and  was  unwilling  to  guess. 

Analyses 

Differences  in  achievement  level  estimates.  The  question  of  interest  with 
respect  to  achievement  level  estimates  was  whether  there  were  differences  in 
achievement  level  estimates  due  to  instruction,  i.e.,  were  students  growing  or 
gaining  in  achievement  levels  throughout  the  course  of  instruction?  Analyses 
pertinent  to  this  question  included  comparisons  or  the  frequency  distributions 
of  number-correct  scores  both  before  and  after  instruction  and  a  t_  test  for  the 
difference  between  the  means  of  scores  on  the  pretest  and  the  posttest.  Compar¬ 
isons  were  also  made  of  the  distributions  of  item  difficulties  for  each  adminis¬ 
tration  of  the  APT.  The  correlation  between  scores  on  the  pretest  and  posttest 
was  computed  as  an  indication  of  the  degree  to  which  the  scores  were  linearly 
related . 

Differences  in  the  structure  of  achievement.  A  related  but  less  often  in¬ 
vestigated  issue  is  whether  there  are  differences  in  the  structure  of  item  re¬ 
sponses  due  to  instruction.  Investigation  of  this  issue  involved  computing  and 
comparing  the  values  of  coefficient  alpha  as  an  index  of  internal  consistency, 
which  is  related  to  the  average  level  of  intercorrelation  of  the  items.  More 
germane  to  this  issue,  however,  was  whether  the  factor  structure  underlying  the 
test  changed  with  instruction  or  whether  it  remained  constant.  Consequently, 


principal  axes  factor  analyses  were  performed  separately  on  the  pretest  and 
posttest  item  responses.  Pearson  product-moment  correlations  were  computed  be¬ 
tween  pairs  of  item  responses,  and  the  diagonal  elements  of  the  interitem  corre¬ 
lation  matrices  were  replaced  with  initial  estimates  of  the  communalities  of 
each  item,  as  given  by  the  squared  multiple  correlation  between  that  item  and 
the  other  items  in  the  matrix.  An  iterative  procedure  for  improving  these  com- 
raunality  estimates  was  used,  successively  extracting  factors  and  re-estimating 
the  communalities.  This  process  continued  until  the  difference  between  two  suc¬ 
cessive  communality  estimates  was  negligible  (see  Nie,  Hull,  Jenkins,  Steln- 
brenner,  &  Bent,  1975). 

Random  sets  of  item  responses  were  generated  by  simulating  the  responses  of 
220  students  to  35  items  such  that  the  probability  of  a  correct  answer  by  any 
simulee  to  an  item  was  equal  to  the  difficulty  (proportion  correct)  of  that 
item.  This  was  done  separately  for  the  pretest  and  the  posttest.  Identical 
procedures  as  performed  for  the  real  data  were  carried  out  for  intercorrelating 
the  item  responses  and  factoring  the  resulting  matrix.  The  results  of  the  fac¬ 
tor  analyses  of  real  and  random  data  were  compared  to  determine  the  number  of 
"nonrandom"  factors  existing  in  the  real  data. 

The  final  factor  solutions  for  the  pretest  and  the  posttest  were  then  com¬ 
pared  in  terms  of  numbers  of  factors  extracted  and  the  similarities  between 
them.  Factor  similarity  was  evaluated  by  computing  the  root-mean-square  devia¬ 
tion,  the  product-moment  correlation  coefficient,  and  the  coefficient  of  congru¬ 
ence  between  the  factor  loadings  of  the  factors  extracted  at  each  test  adminis¬ 
tration  (see  Harman,  1976,  pp.  343-344).  These  similarity  measures  were  com¬ 
pared  with  values  obtained  from  the  two  sets  of  random  data,  as  recommended  by 
Nesselroade  and  Baltes  (1970). 

Results 

Differences  in  Achievement  Level  Estimates 

Total  score  differences.  Frequency  distributions  of  number-correct  scores 
for  both  administrations  of  the  APT  are  presented  in  Appendix  Table  A;  the  fre¬ 
quency  polygons  are  displayed  in  Figure  1.  This  figure  shows  that  although  the 
distribution  of  pretest  scores  was  approximately  symmetric,  the  distribution  of 
posttest  scores  was  negatively  skewed,  indicating  the  presence  of  a  ceiling  ef¬ 
fect.  Only  four  students  answered  all  35  items  correctly  on  the  posttest;  an 
additional  77  students  (or  35%)  incorrectly  answered  less  than  four  items.  The 
mean  score  on  the  pretest  was  22.26,  the  median  was  22.74,  and  the  standard  de¬ 
viation  was  5.97.  For  the  posttest  these  statistics  were  28.91,  30.10,  and 
4.88,  respectively.  A  one-tailed  t  test  for  the  difference  between  means  of 
dependent  groups  was  calculated  to  be  18.67,  with  probability  j>  <  .0001. 

Item  difficulties.  The  differences  in  raw  score  distributions  observed 
between  pretest  and  posttest  were  mirrored  in  the  distributions  of  item  diffi¬ 
culties  for  the  two  administrations  of  the  APT,  as  shown  in  Table  1.  Although 
the  pretest  items  were,  on  the  average,  answered  correctly  more  often  than  not, 
nearly  a  third  of  them  (i.e.,  10  of  35)  were  answered  incorrectly  by  at  least 
half  of  the  students.  For  the  posttest,  however,  only  two  of  the  items  were  as 
difficult.  In  fact,  one  third  of  the  items  (12  of  35)  were  answered  correctly 
by  more  than  90%  of  the  students. 
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Figure  1 

Grouped  Frequency  Distribution  of  Number-Correct  Sco 
for  APT  Pretest  and  Posttest 


Number-Correct  Score 


Table  1 


Frequency  Distributions  of 
Item  Difficulties  for  APT 
Administered  as  Pretest  and  as  Posttest 


Range  of  Item 
Difficulty 

Number 

Pretest 

of  Items 
Posttest 

.00 

- 

.10 

0 

0 

.11 

- 

.20 

1 

0 

.21 

- 

.30 

1 

0 

.31 

- 

.40 

4 

0 

.41 

- 

.50 

4 

2 

.51 

- 

.60 

5 

0 

.61 

- 

.70 

5 

3 

.71 

- 

.80 

6 

9 

.81 

- 

.90 

5 

9 

.91 

-  1 

.00 

4 

12 

Mean 

Difficulty 

.64 

.83 
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Correlation  between  scores.  The  Pearson  product-moment  correlation  coeffi¬ 
cient  between  number-correct  scores  at  the  two  administrations  of  the  APT  was 
.542.  This  relatively  low  value,  coupled  with  the  evidence  of  mean  score  in¬ 
creases,  reveals  that  students  did  not,  to  a  great  extent,  maintain  their  rela¬ 
tive  standings  in  the  course  after  instruction. 

Differences  in  the  Structure  of  Achievement 

Internal  consistency  reliability.  The  internal  consistency  reliability  of 
the  APT,  as  indexed  by  coefficient  alpha,  was  .836  for  the  pretest  and  .835  for 
the  posttest.  That  the  reliability  coefficient  remained  essentially  constant 
provides  some  evidence  for  concluding  that  the  items  were  functioning  together 
in  the  same  manner  before  and  after  instruction.  However,  since  the  variance  of 
the  scores  decreased  somewhat  from  pretest  to  posttest  (see  Appendix  Table  A), 
the  stability  of  coefficient  alpha  may  actually  reflect  a  slight  increase  in  the 
average  interitem  correlation. 

Number  of  factors  extracted.  The  eigenvalues  and  percent  of  total  variance 
accounted  for  by  the  first  15  factors  from  the  APT  and  random  data  are  given  in 
Appendix  Table  B.  The  plots  of  eigenvalues  versus  factors  extracted  for  both 
the  APT  and  the  random  data  are  given  in  Figure  2a  for  the  pretest  and  in  Figure 
2b  for  the  posttest.  In  both  cases,  there  was  one  relatively  strong  factor  in 
the  data;  the  eigenvalue  for  the  first  factor  extracted  from  the  APT  was  much 
larger  than  the  eigenvalues  for  the  remaining  factors  in  the  APT  and  for  all  the 
factors  in  the  random  data.  The  same  cannot  be  said  for  any  of  the  remaining 
factors.  It  was  concluded  that  a  one-factor  solution  adequately  described  the 
item  response  data  from  both  the  pretest  and  the  posttest.  The  FACTOR  subrou¬ 
tine  in  SPSS  (Nie  et  al . ,  1975)  was  then  run  again  on  the  data  from  each  admin¬ 
istration,  specifying  a  single-factor  solution  each  time. 

Factor  similarity.  The  factor  loadings  on  the  single  factor  extracted  from 
each  administration  of  the  APT  and  from  corresponding  random  data  are  given  in 
Table  2.  The  loadings  presented  in  Table  2  were  of  moderate  magnitude;  the  ma¬ 
jority  of  the  loadings  were  greater  than  .300,  but  all  were  less  than  .700.  The 
patterns  and  the  magnitudes  of  the  loadings  were  essentially  the  same  across 
test  administrations.  For  example,  Items  2  through  5  and  Item  28  were  among  the 
items  with  the  lowest  loadings  at  the  pretest;  the  same  was  true  for  these  items 
at  the  posttest.  The  items  with  the  highest  loadings  at  the  pretest  were  also 
among  the  items  with  the  highest  loadings  at  the  posttest.  That  the  magnitude 
of  the  loadings  was  similar  for  the  two  administrations  can  also  be  seen  by  com¬ 
paring  the  percentage  of  total  variance  accounted  for  by  each  factor.  The  sin¬ 
gle  factor  extracted  from  the  APT  pretest  data  accounted  for  13.92%  of  the  total 
variance  compared  to  3.05%  for  the  random  data.  The  factor  extracted  from  the 
APT  posttest  data  was  only  slightly  stronger,  accounting  for  14.59%  of  the  total 
variance  as  compared  to  2.40%  in  the  random  data. 

Table  3  presents  the  measures  of  factor  similarity  between  the  APT  factor 
loadings  at  pretest  and  at  posttest.  The  root-mean-square  deviation  between  the 
loadings  extracted  at  each  administration  is  sensitive  to  differences  in  the 
absolute  levels  of  the  loadings;  low  values  indicate  only  minor  differences  be¬ 
tween  the  values  of  the  two  sets  of  loadings.  The  root-mean-square  deviation 
was  a  low  .089  for  these  data.  The  product-moment  correlation  coefficient  is 


-  8  - 


Table  2 


Factor  Loadings  on  the  Single  Factor 
Extracted  from  APT  at  Pretest  and  at  Posttest, 
and  from  Corresponding  Random  Data 


Pretest 

Posttest 

Item 

APT 

Random  Data 

APT 

Random  Data 

1 

.289 

.124 

.303 

-.042 

2 

.088 

.027 

-.004 

.130 

3 

.058 

.315 

.152 

-.049 

4 

.160 

.010 

.219 

-.051 

5 

.191 

.230 

.226 

.140 

6 

.263 

-.187 

.255 

.172 

7 

.332 

-.188 

.118 

.032 

8 

.315 

.147 

.383 

.036 

9 

.156 

.099 

.341 

.051 

10 

.384 

.150 

.495 

-.017 

11 

.453 

-.229 

.253 

-.277 

12 

.372 

-.178 

.244 

-.170 

13 

.255 

.007 

.259 

-.066 

14 

.394 

.345 

.338 

.136 

15 

.376 

.215 

.440 

.222 

16 

.575 

-.089 

.545 

.023 

17 

.426 

.075 

.436 

-.046 

18 

.562 

-.285 

.484 

.071 

19 

.491 

-.136 

.440 

.330 

20 

.588 

.109 

.506 

.135 

21 

.580 

.029 

.676 

.025 

22 

.460 

.185 

.418 

.212 

23 

.344 

-.200 

.378 

.319 

24 

.370 

.402 

.433 

.084 

25 

.338 

-.028 

.500 

.051 

26 

.460 

.108 

.560 

.005 

27 

.357 

-.074 

.467 

-.015 

28 

.117 

.044 

.141 

.054 

29 

.495 

.042 

.481 

.044 

30 

.291 

.162 

.294 

.196 

31 

.292 

-.276 

.352 

.006 

32 

.378 

.018 

.386 

.017 

33 

.318 

.084 

.281 

.195 

34 

.313 

.090 

.359 

.128 

35 

Percent  of 

.339 

.153 

.267 

-.442 

Total  Variance 

13.92 

3.05 

14.59 

2.40 

sensitive  only  to  differences  in  the  patterns  of  the  loadings  and  was  equal  to 
.793.  The  coefficient  of  congruence  is  sensitive  to  differences  in  both  the 
level  and  the  pattern  of  loadings  and  was  a  high  .972.  High  values  for  these 
latter  two  indices  indicate  a  high  degree  of  similarity  between  the  two  sets  of 
factor  loadings.  The  three  figures  computed  from  the  parallel  random  data  were 


.219,  .067,  and  .118,  respectively.  It  was  concluded  that  the  factors  extracted 
from  each  administration  of  the  APT  were  nearly  identical,  both  in  nature  and  in 
strength. 

Table  3 

Measures  of  Factor  Similarity  Between 
Factor  Loadings  of  APT  at  Pretest 
and  at  Posttest  and  Between  Factor  Loadings 


for  Corresponding 

Random 

Data 

Similarity  Index 

APT 

Random  Data 

Root-Mean-Square- 

Deviation 

.089 

.219 

Pearson  Product-Moment 

Correlation 

.793 

.067 

Coefficient  of 

Congruence 

.972 

.118 

Conclusions 

Differences  in  Achievement  Level  Estimates 

There  was  evidence  in  these  data  to  conclude  that  there  were  gains  in  mean 
achievement  levels  observed  after  a  course  of  instruction.  The  difference  be¬ 
tween  the  means  of  scores  on  the  35-item  pretest  and  posttest  was  nearly  7 
items;  the  frequency  distribution  of  number-correct  scores  changed  from  a  sym¬ 
metric  distribution  to  one  that  was  negatively  skewed  and  displaced  to  the 
right.  This  same  effect  was  mirrored  in  the  distributions  of  item  difficulties. 
The  correlation  between  the  two  s.ets  of  number-correct  scores  was  .542,  indicat¬ 
ing  that  students  did  not  generally  maintain  their  relative  standings  in  the 
course  after  instruction.  It  is  not  known  to  what  extent  this  correlation  was 
attenuated  due  to  the  ceiling  effect  observed  for  the  posttest  scores. 

Differences  in  the  Structure  of  Achievement 

Although  there  was  definitive  evidence  of  mean  quantitative  change  from 
pretest  to  posttest,  there  was  no  evidence  of  qualitative  differences  in  the 
factor  structure  underlying  the  item  responses.  The  internal  consistency  reli¬ 
ability  of  the  test  remained  constant  across  administrations.  When  factor  anal¬ 
yses  were  performed  separately  on  the  pretest  and  posttest  interitem  correlation 
matrices,  essentially  the  same  factor  was  extracted  each  time,  as  evidenced  by 
the  similarity  in  the  levels  and  pattern  of  factor  loadings. 

These  data  indicate,  then,  that  students  in  the  General  College  arithmetic 
classes  were  indeed  leaving  the  course  with  increased  levels  of  the  same  vari¬ 
able  measured  prior  to  instruction.  The  change  that  occurred  within  the  quarter 
was  quantitative,  not  qualitative. 
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STUDY  II 
Method 


Subjects 


Data  were  collected  from  students  enrolled  in  a  general  biology  class  at 
the  University  of  Minnesota  during  winter  quarter  of  1980.  A  paper-and-pencil 
pretest  was  administered  to  all  students  present  on  the  first  day  of  class. 
Computer-administered  conventional  posttests  were  given  before  classroom  mid¬ 
quarter  and  final  examinations  to  volunteer  students  who  were  awarded  extra¬ 
credit  points  for  their  participation. 

Design 


Tests .  There  were  two  different  tests  administered  at  various  times 
throughout  the  quarter.  Test  A  included  14  items  from  each  of  the  three  content 
areas  covered  in  class  lectures  before  the  midquarter  exam  (chemistry,  the  cell, 
and  energy).  Test  B  included  14  items  from  each  of  the  last  three  content  areas 
in  the  course  (genetics,  reproduction/ embryology ,  and  ecology). 

Experimental  groups.  The  data  collection  design  for  this  study  is  shown  in 
Figure  3.  Students  were  randomly  assigned  to  two  experimental  groups,  Groups  1 
and  2,  corresponding  to  the  groups  of  students  who  were  administered  one  of  two 
pretests — Tests  A  or  B,  respectively — on  the  first  day  of  class.  Group  3  in¬ 
cluded  students  who  were  absent  for  the  first  class  meeting  or  who  did  not  re¬ 
cord  on  their  answer  sheet  which  test  they  took. 

Figure  3 

Data  Collection  Design  for  Study  II 


Group  1 


Group  2 


Group  3 


Final 

Exam 

Posttest 


Test  A 


Test  B 


Test  B 
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During  the  two  weeks  Immediately  preceding  the  classroom  midquarter  exami¬ 
nation,  volunteer  students  were  administered  conventional  tests  on  the  computer 
( MQ  posttest).  All  these  students  were  administered  Test  A.  During  the  two 
weeks  Immediately  preceding  the  final  exam,  volunteer  students  were  administered 
conventional  tests  on  the  computer  (final  exam  posttest).  Students  in  Group  1 
were  readministered  Test  A;  students  in  Groups  2  and  3  were  administered  Test  B. 

All  item  responses  were  coded  as  correct,  incorrect,  or  missing.  Missing 
or  omitted  items  did  not  present  an  important  problem  for  this  set  of  data. 
Nevertheless,  the  same  1 5%-missing-data  criterion  was  used  here  as  was  used  in 
the  previous  study:  a  student's  response  protocol  was  deleted  from  the  data  set 
if  the  student  omitted  more  than  6  (i.e.,  15%  of  42)  items  on  any  one  test.  For 
the  students  included  in  the  analysis,  all  missing  data  were  coded  as  incorrect. 

Analyses 


Differences  in  achievement  level  estimates:  Test  A.  The  question  of 
whether  or  not  students'  achievement  level  estimates  on  Test  A  increased  from 
the  pretest  to  the  MQ  posttest  could  be  answered  by  examining  the  performance  of 
Group  1  students  on  Test  A  at  both  testing  occasions.  However,  the  number  of 
students  who  took  Test  A  both  times  was  small  (N  =  102)  compared  to  the  total 
number  of  .udents  who  took  Test  A  at  the  pretest  only  (N  =  276)  and  the  total 

number  of  students  who  took  Test  A  at  the  MQ  posttest  only  (N  *  302).  A  more 

powerful  test  of  the  difference  in  mean  achievement  levels  could  be  performed  by 
combining  the  data  from  all  students  who  took  Test  A  at  the  MQ  posttest  and  by 

comparing  their  performance  with  that  of  all  the  students  who  took  Test  A  as  a 

pretest. 

For  this  comparison,  it  was  necessary  to  assume  that  the  three  groups  of 
students  being  combined  at  the  MQ  posttest  were  equivalent.  Group  1  students 
were  administered  Test  A  both  at  the  pretest  and  at  the  MQ  posttest.  (Although 
Test  A  was  also  administered  again  at  the  final  exam  posttest,  the  number  of 
Group  1  students  who  returned  to  take  Test  A  at  the  final  exam  posttest  was  too 
small  for  meaningful  comparisons  to  be  made.  Hence,  Test  A  analyses  were  con¬ 
fined  to  the  pretest  and  MQ  posttest  administrations.)  Performance  of  Group  1 
students  on  Test  A  at  the  MQ  posttest  can  be  attributed  to  the  students'  under¬ 
lying  ability,  to  the  classroom  instruction,  and/or  to  the  repetition  of  items 
from  one  occasion  to  the  next.  Group  2  students,  on  the  other  hand,  were  admin¬ 
istered  Test  B  as  the  pretest  and  were  administered  Test  A  for  the  first  time  at 
the  MQ  posttest.  Performance  of  Group  2  students  on  Test  A,  then,  could  be  at¬ 
tributed  only  to  the  students'  underlying  ability  and/or  to  the  classroom  in¬ 
struction.  For  some  Group  3  students  (those  who  were  absent  on  the  first  day  of 
class),  performance  on  Test  A  could  also  be  attributed  to  their  underlying  abil¬ 
ity  and/or  to  the  classroom  instruction  only.  For  the  other  Group  3  students 
(those  who  did  not  record  which  pretest  they  took),  however,  Test  A  performance 
could  be  attributed  to  their  underlying  ability,  to  the  classroom  instruction, 
and/or  to  item  repetition.  Since  these  two  subgroups  of  Group  3  students  could 
not  be  identified  and  separated  for  analysis,  however,  Group  3  was  omitted  from 
the  following  comparison  for  Test  A. 

Because  students  were  randomly  assigned  to  Groups  1  and  2  on  the  first  day 
of  class,  and  because  classroom  instruction  was  the  same  for  all  students,  any 
differences  observed  between  Groups  1  and  2  on  their  performance  on  Test  A  would 
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reflect  a  repe tition-of-i terns  effect.  If  mean  test  scores  of  Groups  1  and  2 
were  not  significantly  different  from  each  other,  then  Groups  1  and  2  could  be 
combined  at  the  MQ  posttest  and  compared  with  all  students  from  Group  1  at  the 
pretest.  If  a  significant  repetition-of-items  effect  were  found,  then  subse¬ 
quent  analyses  should  be  performed  only  on  the  data  from  those  students  in  Group 
1.  Differences  between  the  scores  of  Group  1  and  Group  2  students  were  evaluat¬ 
ed  by  the  use  of  a  t  test  for  the  difference  between  two  independent  groups  and 
by  the  Kolmogorov-Smirnov  two-sample  test  for  the  difference  between  two  fre¬ 
quency  distributions. 

Analyses  relevant  to  the  issue  of  differences  in  achievement  scores  includ¬ 
ed  examination  of  the  frequency  distributions  and  summary  statistics  of  num¬ 
ber-correct  scores  and  the  distributions  of  item  difficulties  from  the  pretest 
and  the  MQ  posttest. 

Differences  in  the  structure  of  achievement:  Test  A.  The  question  of 
whether  or  not  there  were  qualitative  changes  in  the  nature  of  achievement  test 
scores  due  to  instruction  was  again  investigated,  as  in  Study  I,  by  analysis  of 
internal  consistency  reliability  coefficients  and  by  separate  principal-axes 
factor  analyses.  These  analyses  were  performed  separately  on  the  pretest  and  MQ 
posttest  data  interitem  correlation  matrices,  with  communalities  estimated  using 
an  iterative  procedure,  as  described  in  Study  I.  The  number  of  nonrandom  fac¬ 
tors  was  again  determined  by  comparing  the  results  of  the  factor  analyses  of 
Test  A  data  with  the  results  of  factor  analyses  of  random  data  based  on  items  of 
similar  difficulty. 

The  results  of  the  final  solutions  from  the  pretest  and  the  MQ  posttest 
were  then  compared  in  terms  of  the  numbers  of  factors  extracted  and  the  similar¬ 
ity  of  these  factors.  As  in  Study  I,  factor  similarity  was  indexed  by  the  root- 
mean-square  deviation,  the  product-moment  correlation  coefficient,  and  the  coef¬ 
ficient  of  congruence  between  the  factor  loadings  obtained  at  each  occasion  in 
comparison  with  values  obtained  from  two  sets  of  random  data. 

Differences  in  achievement  level  estimates:  Test  B.  The  question  of 
whether  or  not  students'  achievement  level  estimates  on  Test  B  increased  from 
the  pretest  to  the  final  exam  posttest  could  be  answered  by  examining  the  per¬ 
formance  of  Group  2  students  on  Test  B  at  both  testing  occasions.  However,  if 
no  significant  repetition-of-items  effect  was  found  for  Test  A  ( as  discussed 
above),  the  assumption  could  be  made  that  there  would  be  no  repetition-of-items 
effect  for  Test  B;  then  there  would  be  justification  for  combining  the  data  on 
Test  B  from  Groups  2  and  3  at  the  final  exam  in  order  to  conduct  a  more  powerful 
test  of  the  difference  between  mean  achievement  level  estimates.  Analyses  rele¬ 
vant  to  this  question  included  examination  of  the  frequency  distributions  and 
summary  statistics  of  number-correct  scores,  and  the  distributions  of  item  dif¬ 
ficulties  from  the  pretest  and  the  final  exam  posttest. 

Differences  in  the  structure  of  achievement:  Test  B.  As  described  above, 
the  internal  consistency  reliability  coefficient  (coefficient  alpha)  was  comput¬ 
ed  for  Test  B  at  the  pretest  and  at  the  final  exam  posttest.  Separate  principal 
axes  factor  analyses  were  also  performed  on  the  Test  B  data  and  on  parallel  ran¬ 
dom  data.  The  final  factor  solutions  of  Test  B  from  the  pretest  and  the  final 
exam  posttest  were  also  compared  in  terms  of  the  number  of  factors  extracted  and 
the  similarity  of  these  factors,  as  was  done  in  Study  I  and  for  Test  A  in  this 
study. 
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Results 


Effect  of  Item  Repetition 

The  effect  on  achievement  level  estimates  of  repeating  items  from  the  pre¬ 
test  to  a  posttest  was  evaluated  by  comparing  the  performance  of  students  in 
Groups  1  and  2  on  Test  A  administered  before  the  midquarter  exam  (MQ  posttest). 
There  were  102  students  from  Group  1  who  volunteered  to  take  the  MQ  posttest,  of 
which  98  met  the  15%-missing-data  criterion  and  were  retained  for  analyses.  For 
Group  2  these  figures  were  101  and  91,  respectively. 

Appendix  Table  C  presents  the  frequency  distributions  of  number-correct 
scores  for  Test  A  administered  at  the  MQ  posttest  to  students  from  Groups  1  and 
2;  the  frequency  polygons  are  displayed  in  Figure  4.  For  Group  1  the  mean  test 
score  was  24.19,  the  median  was  23.79,  and  the  standard  deviation  was  5.87.  For 
Group  2  these  statistics  were  22.59,  21.80,  and  6.26,  respectively.  A  _t  test  of 
the  difference  between  the  means  of  independent  groups  was  calculated  to  be 
1.98;  this  was  not  statistically  significant  at  £  ■  .01.  The  entire  frequency 
distributions  of  Groups  1  and  2  were  compared  by  using  a  Kolmogorov-Smirnov  two- 
sample  test;  the  statistic  calculated  was  equal  to  7.86,  which  was  not  statisti¬ 
cally  significant  at  p  =  .01. 


Figure  4 

Grouped  Frequency  Distributions  of  Number-Correct  Scores 
for  Biology  Test  A  Administered  at  MQ  Posttest 
for  Groups  1  and  2 


Number-Correct  Score 


Although  the  observed  differences  were  in  the  predicted  direction,  the  ef¬ 
fect  of  item  repetition  was  not  statistically  significant.  Hence,  the  question 
of  identifying  and  separating  the  two  subgroups  of  Group  3  was  no  longer  rele¬ 
vant,  and  the  Test  A  MQ  posttest  scores  of  students  in  Groups  1,  2,  and  3  were 
combined  for  comparison  with  the  scores  of  all  students  who  took  Test  A  on  the 
first  day  of  class.  Since  some  of  the  students  who  took  the  test  at  the  pretest 


did  not  take  it  at  the  posttest,  the  correlation  between  scores  at  pretest  and 
post  test  was  not  computed. 

Missing  Data 


There  were  276  students  who  were  administered  Test  A  at  the  pretest;  of 
these  272  met  the  15%-missing-data  criterion  and  were  retained  for  further  anal¬ 
yses.  The  combined  total  of  students  who  took  Test  A  at  the  MQ  posttest  was 
302,  and  283  of  these  were  retained  for  further  analyses. 

Because  there  was  no  effect  of  item  repetition  observed  for  Test  A,  the 
performance  of  Group  2  students  who  were  administered  Test  B  at  the  pretest  was 
compared  with  the  performance  of  students  from  both  Groups  2  and  3  who  were  ad¬ 
ministered  Test  B  at  the  final  exam  posttest.  There  were  283  students  who  were 
administered  Test  B  at  the  pretest,  of  which  277  met  the  15%-missing-data  crite¬ 
rion  and  were  retained  for  further  analyses.  A  total  of  169  students  took  Test 
B  at  the  final  exam  posttest,  and  163  of  them  were  retained  for  further  analy¬ 
ses. 

Differences  in  Achievement  Level  Estimates:  Test  A 


Total  score  differences.  Frequency  distributions  of  number-correct  scores 
on  Test  A  at  both  testing  occasions  are  presented  in  Appendix  Table  D;  the  fre¬ 
quency  polygons  appear  in  Figure  5.  Both  distributions  are  approximately  sym¬ 
metric,  with  the  distribution  of  MQ  posttest  scores  displaced  to  the  right.  The 
mean  of  the  pretest  scores  was  15.97,  with  a  standard  deviation  of  3.97.  For 
the  MQ  posttest  scores,  these  figures  were  23.46  and  5.99,  respectively.  The 
mean  score  difference  between  the  two  occasions  was  7.49.  Because  there  was 
some  overlap  between  the  students  in  the  two  groups,  the  groups  were  not  strict¬ 
ly  independent,  nor  were  they  strictly  dependent.  A  t^  test  for  the  difference 
between  two  independent  means,  although  technically  inappropriate,  would  yield  a 
conservative  test  of  the  significance  of  this  difference.  This  test  resulted  in 
t_  (df  =  553)  =  17.34,  p  <  .001. 

Item  difficulties.  The  frequency  distributions  of  item  difficulties  for 
Test  A  at  both  testing  occasions  are  given  in  Table  4.  As  indicated  earlier, 
the  pretest  was  somewhat  difficult:  74%  of  the  items  were  answered  correctly  by 
less  than  half  the  students,  and  no  item  was  answered  correctly  more  than  80%  of 
the  time.  After  instruction,  more  than  half  the  items  (23  of  42)  were  answered 
correctly  by  51%  to  90%  of  the  students,  although  five  items  were  answered  cor¬ 
rectly  less  than  30%  of  the  time. 

Differences  in  the  Structure  of  Achievement:  Test  A 


Internal  consistency  reliability.  Coefficient  alpha  for  Test  A  when  admin¬ 
istered  on  the  first  day  of  class  was  .490.  This  low  value  Indicates  that  the 
average  interitera  correlation  was  correspondingly  small.  After  instruction, 
coefficient  alpha  increased  to  .787  for  the  same  set  of  items.  Although  this 
value  is  not  high  for  a  42-item  test,  it  represents  a  substantial  increase  over 
the  value  obtained  at  the  pretest.  The  difference  between  these  two  figures  may 
indicate  that  the  items  were  functioning  as  a  set  differently  after  instruction 
than  they  were  before  instruction  and/or  it  may  reflect  the  Increase  in  the 
variance  of  the  number-correct  scores. 


Frequency 
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Figure  5 

Grouped  Frequency  Distributions  of  Number-Correct  Scores 
for  Biology  Test  A  Administered  at  Pretest  and  at  MQ  Posttest 


Table  4 

Frequency  Distributions  of  Item 
Difficulties  for  Biology  Test  A 
Administered  at  Pretest 
and  at  MQ  Posttest 


Range  of  Item 
Difficulty 

Number 

Pretest 

of  Items 
Posttest 

.00  - 

.10 

1 

1 

.11  - 

.20 

8 

1 

.21  - 

.30 

8 

3 

.31  - 

.40 

9 

7 

.41  - 

.50 

5 

7 

.51  - 

.60 

4 

5 

.61  - 

.70 

2 

5 

.71  - 

.80 

5 

5 

.81  - 

.90 

0 

8 

.91  - 

1.00 

0 

0 

Mean  Difficulty 

.38 

.56 
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Number  of  factors  extracted.  Appendix  Table  E  presents  the  eigenvalues  and 
percent  of  total  variance  accounted  for  by  the  first  15  factors  from  Test  A  and 
from  corresponding  random  data.  Figure  6a  presents  the  plots  of  eigenvalues 
versus  factors  extracted  from  Test  A  and  from  random  data  at  the  pretest,  and 
Figure  6b  presents  results  for  the  MQ  posttest.  Comparison  of  the  results  from 
Test  A  with  the  results  from  the  corresponding  random  data  revealed  that  there 
was  one  weak  factor  present  in  the  pretest  and  one  stronger  factor  present  in 
the  posttest. 

Factor  similarity.  Table  5  presents  the  factor  loadings  on  the  single  fac¬ 
tor  extracted  at  each  testing  occasion  from  Test  A  and  from  corresponding  random 
data.  Comparison  of  these  factor  loadings  reveals  that  the  loadings  from  the  MQ 
posttest  were,  in  general,  higher  than  those  from  the  pretest.  No  loading  from 
the  pretest  was  greater  than  .391,  and  nearly  two-thirds  of  the  factor  loadings 
(26  of  42)  were  less  than  .200.  For  the  MQ  posttest,  the  highest  loading  was 
.502,  but  81%  of  the  factor  loadings  (34  of  42)  were  greater  than  .200. 

This  result  can  also  be  seen  by  comparing  the  percentages  of  total  variance 
accounted  for  by  the  single  factor  at  each  administration.  For  the  pretest  that 
figure  was  3.96%  (as  compared  to  2.88%  for  the  random  data);  for  the  MQ  posttest 
the  factor  accounted  for  9.36%  of  the  total  variance  (as  compared  to  2.79%  for 
the  random  data).  Both  of  these  percentages  are  small  for  a  42-item  test,  indi¬ 
cating  that  the  factor  was  relatively  weak,  even  at  the  MQ  posttest. 

The  pattern  of  factor  loadings  did  not  appear  to  be  consistent  across  test 
administrations.  The  items  with  the  lowest  loadings  at  the  pretest  did  not 
emerge  as  the  items  with  the  lowest  loadings  at  the  MQ  posttest,  and  the  same 
was  true  for  the  items  with  the  highest  loadings. 

Table  6  presents  the  measures  of  factor  similarity  between  the  two  sets  of 
loadings  for  Test  A  and  the  corresponding  random  data.  The  root-mean-square 
deviation  between  the  two  sets  of  loadings  for  Test  A,  sensitive  to  differences 
in  levels  of  the  loadings,  was  .195,  a  high  value  when  considered  in  conjunction 
with  the  relatively  narrow  range  of  loadings  observed  in  these  data.  The  prod¬ 
uct-moment  correlation  coefficient  between  the  loadings,  sensitive  to  pattern 
differences,  was  a  low  .373.  The  coefficient  of  congruence  was  .780.  The  simi¬ 
larity  measures  obtained  from  the  random  data  were  .160,  .549,  and  .548,  respec¬ 
tively.  All  these  figures  reveal  that  the  factors  extracted  from  Test  A  on  the 
two  occasions  were  not  substantially  more  similar  than  were  factors  extracted 
from  randomly  generated  data. 

These  data  reveal,  then,  that  the  factor  extracted  from  Test  A  at  the  pre¬ 
test  differed  substantially  from  that  extracted  at  the  MQ  posttest.  Although 
there  was  a  sizeable  increase  in  the  number-correct  scores  after  instruction, 
there  was  a  corresponding  change  in  the  first  factor  underlying  the  item  respon¬ 
ses.  This  indicates  that  the  pretest  and  the  MQ  posttest  measured  quite  differ¬ 
ent  variables,  even  though  they  were  composed  of  exactly  the  same  items. 

Differences  in  Achievement  Level  Estimates:  Test  B 


Total  score  differences.  Frequency  distributions  of  number-correct  scores 
on  Test  B  at  both  testing  occasions  are  given  in  Appendix  Table  F;  their  fre¬ 
quency  polygons  are  presented  in  Figure  7.  The  distribution  of  final  exam  post- 
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Table  5 

Factor  Loadings  on  the  Single  Factor 


Extracted 

from  Biology  Test  A  at  Pretest  and  at 
and  from  Corresponding  Random  Data 

MQ  Posttest, 

Item 

Pretest 

Posttest 

Test  A 

Random  Data 

Test  B 

Random  Data 

1 

.068 

-.032 

.186 

.158 

2 

.024 

-.026 

.133 

-.205 

3 

.331 

-.245 

.161 

.051 

4 

.115 

.163 

.279 

.150 

5 

-.002 

-.238 

.276 

-.099 

6 

.206 

-.054 

.008 

.029 

7 

.280 

.191 

.372 

.121 

8 

.191 

-.246 

.333 

-.153 

9 

.272 

.096 

.408 

.120 

10 

.027 

-.005 

.367 

-.002 

11 

.291 

-.163 

.154 

-.154 

12 

.103 

-.035 

.207 

.011 

13 

.370 

.327 

.502 

.208 

14 

.391 

-.197 

.344 

-.223 

15 

.042 

.440 

.388 

.418 

16 

.273 

-.010 

.341 

.296 

17 

.133 

-.042 

.335 

.079 

18 

.239 

-.105 

.310 

-.162 

19 

.388 

.021 

.276 

.162 

20 

.205 

.362 

.410 

.222 

21 

.115 

-.059 

.316 

-.098 

22 

.223 

-.040 

.479 

-.161 

23 

.383 

.060 

.298 

.024 

24 

.245 

.067 

.373 

-.114 

25 

.052 

-.053 

.228 

.187 

26 

-.024 

-.116 

.246 

-.105 

27 

.039 

.091 

.478 

.083 

28 

.015 

-.094 

.143 

.060 

29 

.117 

.061 

.315 

.244 

30 

.343 

-.139 

.372 

-.224 

31 

.095 

.070 

.200 

.057 

32 

.194 

-.027 

.284 

-.154 

33 

.043 

.179 

.272 

.255 

34 

.059 

-.050 

.249 

.337 

35 

.096 

-.150 

.301 

.190 

36 

-.026 

.148 

.245 

.206 

37 

.221 

-.139 

.340 

-.021 

38 

.107 

-.185 

.227 

-.095 

39 

.106 

.282 

.241 

-.016 

40 

-.111 

-.344 

-.030 

.077 

41 

-.124 

.162 

.164 

-.041 

42 

.063 

.113 

.422 

.117 

Percent  of 

Total  Variance  3.96 

2.88 

9.36 

2.79 
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Table  6 

Measures  of  Factor  Similarity  Between  Factor 
loadings  for  Test  A  at  Pretest  and  at  MQ 
Posttest,  and  Between  Factor  Loadings 
from  Corresponding  Random  Data 


Similarity  Index 

Test  A 

Random  Data 

Root-Mean-Square- 

Deviation 

.195 

.160 

Pearson  Product-Moment 
Correlation 

.373 

.549 

Coefficient  of 
Congruence 

.780 

.548 

test  scores  is  approximately  symmetric,  while  that  of  the  pretest  scores  is 
slightly  positively  skewed.  The  mean  of  the  pretest  scores  was  15.18,  with 
standard  deviation  3.54.  For  the  final  exam  posttest  scores,  these  figures  were 
21.47  and  4.58,  respectively.  The  score  difference  between  the  mean  scores  on 
the  two  occasions  was  6.29.  As  before,  a  £  test  for  the  difference  between  two 
independent  means,  though  technically  inappropriate,  was  conducted  as  a  conser¬ 
vative  test  of  this  difference;  here,  t  (df  •  438)  *  16.15,  £  <  .001. 

Figure  7 

Grouped  Relative  Frequency  Distributions  of  Number-Correct  Scores 
for  Biology  Test  B  Administered  at  Pretest  and  at  Final  Exam  Posttest 


Item  difficulties.  The  frequency  distributions  of  item  difficulties  for 
Test  B  at  both  testing  occasions  are  given  in  Table  7.  As  was  observed  for  the 
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number-correct  scores,  the  pattern  of  item  difficulties  reveals  that  the  pretest 
was  somewhat  difficult:  74%  of  the  items  were  answered  correctly  by  less  than 
half  the  students,  and  only  two  items  were  answered  correctly  more  than  80%  of 
the  time.  At  the  end  of  the  course,  more  than  half  the  items  (22  of  42)  were 
answered  correctly  by  the  majority  of  students,  although  12  items  were  answered 
correctly  less  than  30%  of  the  time. 


Table  7 

Frequency  Distributions  of  Item 
Difficulties  for  Biology  Test  B 
Administered  at  Pretest  and 
at  Final  Exam  Posttest 


Range  of  Item 
Difficulty 

Number 

Pretest 

of  Items 
Posttes  t 

.00  -  .10 

4 

2 

.11  -  .20 

9 

3 

.21  -  .30 

8 

7 

.31  -  .40 

3 

4 

.41  -  .50 

7 

4 

.51  -  .60 

5 

2 

.61  -  .70 

2 

10 

.71  -  .80 

2 

5 

.81  -  .90 

2 

4 

.91  -1.00 

0 

1 

Mean  Difficulty 

.36 

.51 

Differences  in  the  Structure  of  Achievement:  Test  B 


Internal  consistency  reliability.  When  administered  at  the  pretest  on  the 
first  day  of  class,  coefficient  alpha  for  Test  B  was  .398,  increasing  to  .630 
when  administered  at  the  final  exam  posttest.  These  low  values  indicate  that 
the  average  interitem  correlation  coefficient  was  correspondingly  small.  Even 
though  both  reliability  coefficients  were  relatively  low,  the  fact  that  the  re¬ 
liability  coefficient  increased  from  .40  to  .63  may  be  an  indication  that  the 
items  were  functioning  as  a  set  differently  after  instruction  than  they  were 
before  instruction.  As  before,  however,  this  increase  may  simply  be  reflecting 
the  increase  in  the  variance  of  the  test  scores. 

Number  of  factors  extracted.  Appendix  Table  G  presents  the  eigenvalues  and 
percentages  of  total  variance  accounted  for  by  the  first  15  factors  extracted 
from  Test  B  and  from  corresponding  random  data.  Figure  8a  presents  the  plots  of 
these  eigenvalues  versus  factors  extracted  at  the  pretest,  and  Figure  8b  pre¬ 
sents  similar  data  from  the  final  exam  posttest.  Comparison  of  the  results  from 
the  real  data  with  thj  results  from  the  random  data  reveals  that  there  was  no 
factor  stronger  than  one  extracted  from  the  random  data  in  the  pretest,  but  one 
stronger  factor  was  extracted  from  Test  B  at  the  final  exam  posttest. 

Factor  similarity.  Table  8  presents  the  factor  loadings  on  the  single  fac¬ 
tor  extracted  at  each  testing  occasion  from  Test  B  and  from  corresponding  random 
data.  Comparison  of  these  factor  loadings  reveals  that  the  loadings  from  the 
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Figure  8 

Eigenvalues  for  the  First  15  Factors  Extracted  from  Biology  Test  B 
Administered  at  Pretest  and  at  Final  Exam  Posttest, 
and  from  Corresponding  Random  Data 

(a)  Pretest 


Table  8 


Factor  Loadings 

on  the  Single 

Factor  Extracted 

f  ron 

Biology  Test  B 

at  Pretest  and 

at  Final 

Exam  Posttest 

and  from 

Corresponding 

Random  Data 

Pretest 

Posttest 

Item 

Test  B 

Random  Data 

Test  B 

Random  Data 

1 

.131 

.088 

.295 

-.044 

2 

.073 

.087 

.310 

.377 

3 

-.023 

-.168 

.193 

.258 

4 

.218 

.122 

.416 

.098 

5 

.252 

-.286 

.137 

.113 

6 

.268 

.145 

.240 

.179 

7 

.191 

.145 

.256 

-.236 

8 

.127 

-.113 

.296 

.246 

9 

-.044 

.293 

.273 

-.066 

10 

.323 

-.320 

.255 

.296 

11 

.193 

.471 

.202 

.060 

12 

.164 

.117 

.311 

-.239 

13 

.393 

-.111 

.371 

.161 

14 

-.007 

-.136 

.438 

.030 

15 

.228 

-.085 

.261 

.045 

16 

.329 

-.099 

.301 

.284 

17 

.246 

-.252 

.310 

.193 

18 

.154 

.381 

.372 

-.073 

19 

.192 

-.098 

.241 

.006 

20 

-.027 

.341 

.193 

-.013 

21 

.231 

-.151 

.307 

.092 

22 

-.239 

-.156 

.268 

.411 

23 

.459 

.213 

.299 

.162 

24 

.062 

.067 

.079 

.140 

25 

.009 

.182 

.330 

-.037 

26 

.045 

-.101 

.174 

-.044 

27 

-.101 

.034 

-.112 

-.057 

28 

.130 

-.080 

.043 

.112 

29 

.296 

-.245 

.084 

.088 

30 

.215 

.077 

.155 

.328 

31 

.252 

.179 

.397 

.003 

32 

.278 

.020 

.177 

-.123 

33 

-.045 

.045 

-.112 

-.082 

34 

.028 

-.277 

.137 

.003 

35 

.012 

.384 

.165 

.093 

36 

.166 

-.012 

-.071 

.047 

37 

-.115 

-.034 

-.023 

-.026 

38 

.018 

.060 

-.002 

.009 

39 

.082 

.120 

.011 

.053 

40 

.040 

.109 

.178 

-.088 

41 

.013 

-.457 

.205 

-.015 

42 

-.058 

.510 

-.111 

-.071 

Percent 

of 

Total  Variance  3.69 

4.70 

5.96 

2.54 
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final  exam  posttest  were,  in  general,  slightly  higher  than  those  from  the  pre¬ 
test.  The  highest  pretest  loading  was  .459,  and  nearly  two-thirds  of  the  factor 
loadings  (27  of  42)  were  less  than  .200.  For  the  final  exam  posttest,  the  high¬ 
est  loading  was  .438,  but  more  than  half  of  the  factor  loadings  (23  of  42)  were 
greater  than  .200. 

This  result  can  also  be  seen  by  comparing  the  percentage  of  total  variance 
accounted  for  by  the  single  factor  extracted  at  each  administration.  For  the 
pretest,  that  figure  was  3.69%  (as  compared  to  4.70%  accounted  for  by  the  random 
factor);  for  the  final  exam  posttest,  the  factor  accounted  for  5.96%  of  the  to¬ 
tal  variance  (as  compared  to  2.54%  for  the  random  data).  Both  of  these  percent¬ 
ages  are  very  small,  indicating  that  the  factor  was  relatively  weak. 

The  pattern  of  factor  loadings  did  not  appear  consistent  across  test  admin¬ 
istrations.  The  items  with  the  lowest  loadings  at  the  pretest  did  not  necessar¬ 
ily  emerge  as  the  items  with  the  lowest  loadings  at  the  final  exam  posttest,  and 
the  same  was  true  for  the  items  with  the  highest  loadings. 

Table  9  presents  the  measures  of  factor  similarity  for  Test  B.  The  root- 
mean-square  deviation  between  the  two  sets  of  loadings  for  Test  B,  sensitive  to 
differences  in  levels  of  the  loadings,  was  .177,  a  high  value  when  considered  in 
conjunction  with  the  relatively  narrow  range  of  loadings  observed  in  this  data 
but  lower  than  the  .300  observed  for  the  two  sets  of  random  data.  The  product- 
moment  correlation  coefficient  between  the  loadings,  sensitive  to  pattern  dif¬ 
ferences,  was  a  low  .399  as  contrasted  with  £  =  -.327  for  the  random  data.  The 
coefficient  of  congruence  was  .697  for  Test  B  and  -.255  for  the  random  data. 
Although  the  comparison  of  the  similarity  measures  reveals  that  the  factor  load¬ 
ings  for  Test  B  were  more  congruent  than  the  corresponding  sets  of  random  data, 
the  degree  of  similarity  was  so  low  that  these  factors  could  not  justifiably  be 
considered  congruent. 


Table  9 

Measures  of  Factor  Similarity  Between  Factor 
Loadings  from  Test  B  at  Pretest  and  at  Final 
Exam  Posttest,  and  Between  Factor  Loadings 
from  Corresponding  Random  Data 


Similarity  Index 

Test  B 

Random  Data 

Root-Mean-Square 

Deviation 

.177 

.300 

Pearson  Product-Moment 
Correlation 

.399 

-.327 

Coefficient  of 
Congruence 

.696 

-.255 

These  data  reveal,  then,  that  the  factor  extracted  from  Test  B  at  the  pre¬ 
test  differed  from  the  factor  extracted  at  posttest.  As  was  observed  for  Test 
A,  there  was  a  sizeable  increase  in  the  number-correct  scores,  accompanied  by  a 
change  in  the  factor  underlying  the  item  responses.  This  indicates  that  the 
pretest  and  the  final  exam  posttest  were  measuring  quite  different  variables, 
even  though  they  were  composed  of  exactly  the  same  items. 
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Conclusions 


Differences  in  Achievement  Level  Estimates 


The  results  from  both  Test  A  and  Test  B  indicate  that  there  were  mean  dif¬ 
ferences  in  achievement  level  estimates  (number-correct  scores)  that  accompanied 
classroom  instruction.  On  the  average,  test  scores  Increased  after  relevant 
course  instruction;  for  these  data,  scores  increased  between  6  and  7.5  points  on 
a  42-item  test.  The  Increases  in  these  test  scores  were  not  attributable  to  the 
effect  of  item  repetition.  Although  the  differences  were  in  the  predicted  di¬ 
rection,  neither  a  £  test  nor  the  Ko lmogorov-Smi rnov  two-sample  test  were  sig¬ 
nificant  at  £  *  .01. 

Differences  in  the  Structure  of  Achievement 


There  were  substantial  differences  in  the  structure  of  item  responses  to 
the  items  on  both  biology  tests--Test  A  and  Test  B — from  the  pretest  to  the 
posttest.  Large  increases  in  the  internal  consistency  reliability  coefficient 
may  reflect  corresponding  changes  in  the  average  interitem  correlation  coeffi¬ 
cients.  That  is,  changes  in  the  way  the  items  functioned  together  as  a  set  were 
evident  after  instruction  took  place.  This  same  effect  was  observed  when  the 
factor  structures  of  the  tests  at  both  administrations  were  compared.  Although 
only  one  factor  was  extracted  at  each  administration  of  each  test,  the  factor  at 
each  pretest  was  very  weak  and  bore  little  relationship  to  the  factor  extracted 
later  in  the  course,  as  reflected  in  the  patterns  and  levels  of  the  factor  load¬ 
ings. 


DISCUSSION  AND  CONCLUSIONS 


The  results  of  these  studies  show  that  the  use  of  simple  difference  scores 
to  measure  change  in  classroom  achievement  may  not  be  appropriate  for  all  sub¬ 
ject  matter  areas.  The  use  of  simple  difference  scores,  or  some  derivative 
thereof,  assumes  that  there  is  only  a  quantitative  difference  between  pretest 
and  posttest  achievement  levels  due  to  a  course  of  instruction.  That  is,  the 
assumption  is  made  that  a  pretest  measures  a  baseline  amount  of  some  knowledge 
or  trait  and  that  classroom  instruction  results  in  increased  levels  of  the  same 
trait,  as  indicated  by  higher  scores  on  the  same,  or  a  similar,  test. 

This  assumption  was  supported  by  the  results  of  the  mathematics  data. 

There  was  a  large  and  statistically  significant  difference  observed  in  achieve¬ 
ment  test  scores  obtained  before  and  after  instruction.  That  the  same  trait  was 
being  measured  both  times  was  indicated  by  the  high  degree  of  similarity  of  the 
underlying  factor  structure  of  the  test  when  examined  at  both  points  in  time. 

The  only  change  observed  in  the  mathematics  test  scores  was,  then,  a  quantita¬ 
tive  one,  reflected  in  increases  in  mean  number-correct  score  after  classroom 
instruction  in  mathematics. 

The  results  were  quite  different  for  the  two  biology  tests  examined.  Fac¬ 
tor  analyses  of  the  pretests  revealed  the  presence  of  one  very  weak  factor  for 
each  pretest.  One  slightly  stronger  factor  also  emerged  at  each  of  the  post¬ 
tests,  but  there  was  very  little  correspondence  between  the  pretest  and  posttest 
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factors.  Even  though  mean  test  scores  increased  after  instruction,  there  was  a 
corresponding  difference  in  the  factors  underlying  test  performance.  The  change 
that  occurred  in  the  biology  test  scores,  then,  was  a  qualitative  one,  where  the 
tests  were  measuring  different  variables  before  and  after  instruction.  Evaluat¬ 
ing  gains  in  achievement  by  computing  pretest-posttest  difference  scores  cannot 
be  justified  under  these  circumstances. 

That  the  results  from  these  two  studies  are  different  has  important  bearing 
on  the  issue  of  program  evaluation  and  the  measurement  of  change.  The  question 
of  whether  the  difference  in  test  scores  that  follows  classroom  instruction  or 
program  participation  is  quantitative  or  qualitative  must  be  answered  before  any 
attempt  at  quantifying  change  can  legitimately  be  made.  For  some  courses  of 
instruction,  the  application  of  classical  change-score  methodology  may  be  de¬ 
fended  on  the  grounds  that  the  only  change  observed  was  quantitative;  for  oth¬ 
ers,  the  use  of  such  methodology  may  not  be  justified.  Clearly,  further  re¬ 
search  is  needed  to  define  those  areas  where  the  use  of  change  scores  or  their 
derivatives  may  be  warranted. 
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Appendix:  Supplementary  Tables 


Table  A 

Frequency  Distributions  of  Number-Correct  Scores 
for  APT  Pretest  and  Posttest  (N*220) 


Score 

Pretest 

Posttest 

Frequency 

Percent 

Cumulative 

Percent 

Frequency 

Percent 

Cumulative 

Percent 

35 

0 

0 

100.0 

4 

1.8 

100.0 

34 

1 

.5 

100.0 

20 

9.1 

98.2 

33 

4 

1.8 

99.5 

28 

12.7 

89.1 

32 

7 

3.2 

97.7 

29 

13.2 

76.4 

31 

7 

3.2 

94.5 

19 

8.6 

63.2 

30 

13 

5.9 

91.4 

25 

11.4 

54.5 

29 

5 

2.3 

85.5 

16 

7.3 

43.2 

28 

13 

5.9 

83.2 

19 

8.6 

35.9 

27 

5 

2.3 

77.3 

11 

5.0 

27.3 

26 

8 

3.6 

75.0 

8 

3.6 

22.3 

25 

14 

6.4 

71.4 

7 

3.2 

18.6 

24 

20 

9.1 

65.0 

7 

3.2 

15.5 

23 

17 

7.7 

55.9 

6 

2.7 

12.3 

22 

10 

4.5 

48.2 

1 

0.5 

9.5 

21 

14 

6.4 

43.6 

5 

2.3 

9.1 

20 

16 

7.3 

37.3 

4 

1.8 

6.8 

19 

6 

2.7 

30.0 

1 

0.5 

5.0 

18 

11 

5.0 

27.3 

3 

1.4 

4.5 

17 

11 

5.0 

22.3 

1 

0.5 

3.2 

16 

9 

4.1 

17.3 

0 

0.0 

2.7 

15 

7 

3.2 

13.2 

0 

0.0 

2.7 

14 

2 

0.9 

10.0 

1 

0.5 

2.7 

4 

1.8 

9.1 

2 

0.9 

2.3 

12 

4 

1.8 

7.3 

1 

0.5 

1.4 

11 

7 

3.2 

5.5 

0 

0.0 

0.9 

1 

0.5 

2.3 

1 

0.5 

0.9 

9 

0 

0.0 

1.8 

0 

0.0 

0.5 

8 

3 

1.4 

1.8 

1 

0.5 

0.5 

7 

1 

0.5 

0.5 

0 

0.0 

0.0 

Mean 

22.26 

28.91 

SD 

5.97 

4.88 

Median 

22.74 

30.10 

Mode 

24 

32 

Table  B 

Eigenvalues  and  Percent  of  Total  Variance 
Accounted  for  by  First  15  Factors  Extracted  from  the  APT 
at  Pretest  and  at  Posttest,  and  from  Corresponding  Random  Data 


_ Pretest _  _ Posttest _ 

_ APT _  Random  Data  _ APT _  Random  Data 

Eigen-  %  Total  Eigen-  %  Total  Eigen-  X  Total  Eigen-  %  Total 


Factor 

Value 

Variance 

Value 

Variance 

Value 

Variance 

Value 

Variance 

1 

5.350 

15.3 

1.545 

4.4 

5.590 

16.0 

1.419 

4.1 

2 

1.555 

4.4 

1.308 

3.7 

1.605 

4.6 

1.253 

3.6 

3 

1.539 

4.4 

1.229 

3.5 

1.337 

3.8 

1.161 

3.3 

4 

1.209 

3.5 

1.139 

3.3 

1.171 

3.3 

1.134 

3.2 

5 

1.086 

3.1 

1.029 

2.9 

1.034 

3.0 

1.052 

3.0 

6 

1.016 

2.9 

.993 

2.8 

1.006 

2.9 

1.023 

2.9 

7 

.942 

2.7 

.890 

2.5 

.986 

2.8 

.896 

2.6 

8 

.892 

2.5 

.865 

2.5 

.939 

2.7 

.828 

2.4 

9 

.876 

2.5 

.822 

2.3 

.839 

2.4 

.814 

2.3 

10 

.794 

2.3 

.767 

2.2 

.797 

2.3 

.790 

2.3 

11 

.739 

2.1 

.745 

2.1 

.756 

2.2 

.770 

2.2 

12 

.666 

1.9 

.692 

2.0 

.675 

1.9 

.732 

2.1 

13 

.607 

1.7 

.634 

1.8 

.660 

1.9 

.702 

2.0 

14 

.597 

1.7 

.600 

1.7 

.604 

1.7 

.666 

1.9 

15 

.553 

1.6 

.566 

1.6 

.533 

1.5 

.600 

1.7 

Table  C 

frequency  Distribution  of  Number-Correct  Scores  for 
Biology  Test  A  at  MQ  Posttest  for  Students  in  Groups  1  and  2 


Score 

Group  1  ( N-98) 

Group  2  (N-91) 

Frequency 

Percent 

Cumulative 

Percent 

Frequency 

Percent 

Cumulative 

Percent 

41 

1 

1.0 

100.0 

0 

0.0 

100.0 

40 

0 

0.0 

99.0 

0 

0.0 

100.0 

3* 

0 

0.0 

99.0 

0 

0.0 

100.0 

38 

0 

0.0 

99.0 

1 

1.1 

100.0 

37 

2 

2.0 

99.0 

1 

1.1 

98.9 

36 

1 

1.0 

96.9 

0 

0.0 

97.8 

35 

0 

0.0 

95.9 

1 

1.1 

97.8 

34 

1 

1.0 

95.9 

3 

3.3 

96.7 

33 

2 

2.0 

94.9 

1 

1.1 

93.4 

32 

3 

3.1 

92.9 

2 

2.2 

92.3 

31 

2 

2.0 

89.8 

4 

4.4 

90.1 

30 

5 

5.1 

87.8 

1 

1.1 

85.7 

29 

6 

6.1 

82.7 

3 

3.3 

84.6 

28 

4 

4.1 

76.5 

1 

1.1 

81.3 

27 

5 

5.1 

72.4 

6 

6.6 

80.2 

26 

6 

6.1 

67.3 

5 

5.5 

73.6 

25 

6 

6.1 

61.2 

5 

5.5 

68.1 

24 

7 

7.1 

55.1 

2 

2.2 

62.6 

23 

10 

10.2 

48.0 

6 

6.6 

60.4 

22 

7 

7.1 

37.8 

5 

5.5 

53.8 

21 

9 

9.2 

30.6 

6 

6.6 

48.4 

20 

3 

3.1 

21.4 

6 

6.6 

41.8 

19 

5 

5.1 

18.4 

4 

4.4 

35.2 

18 

2 

2.0 

13.3 

9 

9.9 

30.8 

17 

3 

3.1 

11.2 

5 

5.5 

20.9 

16 

1 

1.0 

8.2 

5 

5.5 

15.4 

15 

1 

1.0 

7.1 

3 

3.3 

9.9 

14 

1 

1.0 

6.1 

1 

1.1 

6.6 

13 

1 

1.0 

5.1 

1 

1.1 

5.5 

12 

2 

2.0 

4.1 

1 

1.1 

4.4 

11 

1 

1.0 

2.0 

2 

2.2 

3.3 

10 

0 

0.0 

1.0 

1 

1.1 

1.1 

9 

1 

1.0 

1.0 

0 

0 

0.0 

Mean 

24.19 

22.59 

SD 

5.87 

6.26 

Median 

23.79 

21.80 

Mode 

23 

18 
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Table  D 

Frequency  Distribution  of  Number-Correct  Scores 


for 

Biology  Test 

A  at  Pretest 

and  at  MQ 

Posttest 

Pretest  (N=272) 

Posttest  (N-283) 

Cumulative 

Cumulative 

Score 

Frequency 

Percent 

Percent 

Frequency 

Percent 

Percent 

41 

0 

0.0 

100.0 

1 

0.4 

100.0 

40 

0 

0.0 

100.0 

0 

0.0 

99.6 

39 

0 

0.0 

100.0 

0 

0.0 

99.6 

38 

0 

0.0 

100.0 

1 

0.4 

99.6 

37 

0 

0.0 

100.0 

4 

1.4 

99.3 

36 

0 

0.0 

100.0 

2 

0.7 

97.9 

35 

0 

0.0 

100.0 

3 

1.1 

97.2 

34 

0 

0.0 

100.0 

4 

1.4 

96.1 

33 

0 

0.0 

100.0 

5 

1.8 

94.7 

32 

0 

0.0 

100.0 

6 

2.1 

92.9 

31 

0 

0.0 

100.0 

9 

3.2 

90.8 

30 

1 

0.0 

100.0 

8 

2.8 

87.6 

29 

0 

0.0 

99.6 

15 

5.3 

84.8 

28 

1 

0.4 

99.6 

9 

3.2 

79.5 

27 

1 

0.4 

99.3 

17 

6.0 

76.3 

26 

0 

0.0 

98.9 

16 

5.7 

70.3 

25 

2 

0.7 

98.9 

23 

8.1 

64.7 

24 

5 

1.8 

98.2 

15 

5.3 

56.5 

23 

8 

2.9 

96.3 

24 

8.5 

51.2 

22 

6 

2.2 

93.4 

15 

5.3 

42.8 

21 

8 

2.9 

91.2 

19 

6.7 

37.5 

20 

9 

3.3 

88.2 

14 

4.9 

30.7 

19 

25 

9.2 

84.9 

10 

3.5 

25.8 

18 

23 

8.5 

75.7 

16 

5.7 

22.3 

17 

34 

12.5 

67.3 

13 

4.6 

16.6 

16 

23 

8.5 

54.8 

9 

3.2 

12.0 

15 

24 

8.8 

46.3 

7 

2.5 

8.8 

14 

30 

11.0 

37.5 

5 

1.8 

6 .  A 

13 

25 

9.2 

26.5 

3 

1.1 

4.6 

12 

13 

4.8 

17.3 

3 

1.1 

3.5 

11 

15 

5.5 

12.5 

4 

1.4 

2.5 

10 

7 

2.6 

7.0 

1 

0.4 

1.1 

9 

5 

1.8 

4.4 

2 

0.7 

0.7 

8 

3 

1.1 

2.6 

0 

0.0 

0.0 

7 

3 

1.1 

1.5 

0 

0.0 

0.0 

6 

0 

0.0 

0.4 

0 

0.0 

0.0 

5 

0 

0.0 

0.4 

0 

0.0 

0.0 

4 

1 

0.4 

0.4 

0 

0.0 

0.0 

Mean 

15.97 

23.46 

SD 

3.97 

5.99 

Median 

15.94 

23.35 

Mode 

17 

23 

32  - 


Table  E 

Eigenvalues  and  Percent  of  Total  Variance  Accounted  for  by 
First  15  Factors  Extracted  from  Biology  Test  A  at  Pretest 
and  at  MQ  Posttest  and  Corresponding  Random  Data 
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Table  F 

Frequency  Distribution  of  Number-Correct  Scores 
for  Biology  Test  B  at  Pretest  and  at  Final  Exam  Posttest 


Score 

Pretest  (N“277) 

Posttest  ( N»163) 

Frequency 

Percent 

Cumulative 

Percent 

Frequency 

Percent 

Cumulative 

Percent 

33 

0 

0.0 

100.0 

1 

0.6 

100.0 

32 

1 

0.4 

100.0 

2 

1.2 

99.4 

31 

0 

0.4 

100.0 

3 

1.8 

98.2 

30 

0 

0.4 

100.0 

1 

0.6 

96.3 

29 

0 

0.4 

100.0 

5 

3.1 

95.7 

28 

0 

0.4 

100.0 

8 

4.9 

92.6 

27 

0 

0.4 

100.0 

8 

4.9 

87.7 

26 

0 

0.0 

100.0 

6 

3.7 

82.8 

25 

1 

0.4 

99.6 

5 

3.1 

79.1 

24 

2 

0.7 

99.3 

8 

4.9 

76.1 

23 

4 

1.4 

98.6 

13 

8.0 

71.2 

22 

4 

1.4 

97.1 

16 

9.8 

63.2 

21 

6 

2.2 

95.7 

17 

10.4 

53.4 

20 

10 

3.6 

93.5 

15 

9.2 

42.9 

19 

12 

4.3 

89.9 

10 

6.1 

33.7 

18 

27 

9.7 

85.6 

12 

7.4 

27.6 

17 

31 

11.2 

75.8 

10 

6.1 

20.2 

16 

29 

10.5 

64.6 

10 

6.1 

14.1 

15 

30 

10.8 

54.2 

5 

3.1 

8.0 

14 

29 

10.5 

43.3 

3 

1.8 

4.9 

13 

23 

8.3 

32.9 

3 

1.8 

3.1 

12 

22 

7.9 

24.5 

0 

0.0 

1.2 

11 

21 

7.6 

16.6 

2 

1.2 

1.2 

10 

16 

5.8 

9.0 

0 

0.0 

0.0 

9 

7 

2.5 

3.2 

0 

0.0 

0.0 

8 

2 

0.7 

0.7 

0 

0.0 

0.0 

Mean 

15.18 

21.47 

SD 

3.54 

4.58 

Median 

15.12 

21.18 

Mode 

17 

21 

34  - 


Table  G 

Eigenvalues  and  Percent  of  Total  Variance  Accounted  for  by  First 
15  Factors  Extracted  from  Biology  Test  B  at  Pretest  and  at  Final  Exam 
Posttest  and  from  Corresponding  Random  Data 


Pretest  Final  Exam  Posttest 


Factor 

Test  B 

Random  Data 

Test  B 

Random 

Data 

Eigen¬ 

value 

%  Total 
Variance 

Eigen¬ 

value 

%  Total 
Variance 

Eigen¬ 

value 

%  Total 
Variance 

Eigen¬ 

value 

%  Total 
Variance 

1 

2.043 

4.9 

2.440 

5.8 

3.124 

7.4 

1.810 

4.3 

2 

1.551 

3.7 

1.448 

3.4 

1.920 

4.6 

1.678 

4.0 

3 

1.345 

3.2 

1.190 

2.8 

1.590 

3.8 

1.550 

3.7 

4 

1.204 

2.9 

1.146 

2.7 

1.480 

3.5 

1.513 

3.6 

5 

1.152 

2.7 

1.098 

2.7 

1.383 

3.3 

1.466 

3.5 

6 

1.065 

2.5 

1.053 

2.5 

1.309 

3.1 

1.370 

3.3 

7 

.932 

2.2 

.999 

2.4 

1.284 

3.1 

1.305 

3.1 

8 

.911 

2.2 

.929 

2.2 

1.167 

2.8 

1.234 

2.9 

9 

.887 

2.1 

.920 

2.2 

1.151 

2.7 

1.215 

2.9 

10 

.835 

2.0 

.852 

2.0 

1.059 

2.5 

1.105 

2.6 

11 

.796 

1.9 

.770 

1.8 

.978 

2.3 

1.030 

2.5 

12 

.781 

1.9 

.739 

1.8 

.964 

2.3 

.966 

2.3 

13 

.747 

1.8 

.702 

1.7 

.927 

2.2 

.895 

2.1 

14 

.709 

1.7 

.684 

1.6 

.911 

2.2 

.857 

2.0 

15 

.685 

1.6 

.668 

1.6 

.819 

2.0 

.803 

1.9 
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